threshold state
2172fde49301047270b2897085e4319d-Supplemental.pdf
We define a Gibbs distribution associated with the problem and evaluate its normalization constant -thepartition functionZ. The partition function, and in particular its logarithm divided by the temperature - the free energy -, contains all the information we aim to understand in the problem. By taking the proper derivatives, and possibly add an external field, we can compute relevant macroscopic properties such as: average overlapwiththegroundtruth,averagelossachieved. Indisordered systemwehavetoconsider the additional complication given by the randomness.
- Europe > France (0.05)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Europe > Italy (0.04)
- Europe > Switzerland (0.04)
- North America > Canada (0.04)
- Europe > Italy > Lazio > Rome (0.04)
- (3 more...)
- North America > Canada (0.04)
- Europe > Italy > Lazio > Rome (0.04)
- Europe > France > Île-de-France > Paris > Paris (0.04)
- Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)
- Europe > Switzerland (0.04)
- North America > Canada (0.04)
- Europe > Italy > Lazio > Rome (0.04)
- (3 more...)
We kindly thank the reviewers (R1
We discuss the main points below. Why in Figure 1 the Hessian is not marginal before the transition? We will add in the supplementary a version of Fig.2 for a value of We will add clarifications in the text when it can give rise to confusion. Above the threshold this algorithm can find the solution. Why threshold energy is higher for more samples?
Review for NeurIPS paper: Complex Dynamics in Simple Neural Networks: Understanding Gradient Flow in Phase Retrieval
Weaknesses: One of the main assumptions used in characterizing the "threshold states" at which the gradient flow dynamics appear to get trapped is that the Hessian is positive semidefinite. On the other hand, in figure 2 as the training loss crosses the threshold energy the minimal eigenvalues of the Hessian appear clearly negative, unless I am misunderstanding the figure. The authors do not appear to address this point. Could the soundness of this assumption also account for the inaccuracy of the computed value of alpha at which the phase transition occurs which can be seen in figure 4? The main result of the paper - the computation of the relative sample size alpha at which the phase transition occurs, doesn't seem to be very accurate when compared to experiments in figure 3 and 5. It would have also been helpful to plot this value in the figure in order to make this point clear. The discrepancy could be a result of finite size effects as the authors claim, but could also be a result of say the assumption made about the Hessian at the threshold states or the accuracy of the 1RSB ansatz.
Review for NeurIPS paper: Complex Dynamics in Simple Neural Networks: Understanding Gradient Flow in Phase Retrieval
The paper makes interesting contributions towards understanding non-convex optimization by studying a problem that is simple enough to allow for analytical calculations. Overall, there is a decent, well-supported agreement between theory and experiment (in particular, between the leading moments of the distribution of the threshold states as evaluated empirically and the computed moments). This paper is a valuable contribution to NeurIPS and should be accepted. Overall, however, we recommend various lines along which the paper could improve further to reach a wider audience, and we recommend that the authors revisit the author feedback before they submit their final version. First, the paper presentation is somewhat unusually difficult to follow from the perspective of the machine learning audience and could be improved by providing more background on known results that were used in the paper (e.g., the BPP transition or replica theory), if necessary in the appendix.
From Zero to Hero: How local curvature at artless initial conditions leads away from bad minima
Bonnaire, Tony, Biroli, Giulio, Cammarota, Chiara
We investigate the optimization dynamics of gradient descent in a non-convex and high-dimensional setting, with a focus on the phase retrieval problem as a case study for complex loss landscapes. We first study the high-dimensional limit where both the number $M$ and the dimension $N$ of the data are going to infinity at fixed signal-to-noise ratio $\alpha = M/N$. By analyzing how the local curvature changes during optimization, we uncover that for intermediate $\alpha$, the Hessian displays a downward direction pointing towards good minima in the first regime of the descent, before being trapped in bad minima at the end. Hence, the local landscape is benign and informative at first, before gradient descent brings the system into a uninformative maze. The transition between the two regimes is associated to a BBP-type threshold in the time-dependent Hessian. Through both theoretical analysis and numerical experiments, we show that in practical cases, i.e. for finite but even very large $N$, successful optimization via gradient descent in phase retrieval is achieved by falling towards the good minima before reaching the bad ones. This mechanism explains why successful recovery is obtained well before the algorithmic transition corresponding to the high-dimensional limit. Technically, this is associated to strong logarithmic corrections of the algorithmic transition at large $N$ with respect to the one expected in the $N\to\infty$ limit. Our analysis sheds light on such a new mechanism that facilitate gradient descent dynamics in finite large dimensions, also highlighting the importance of good initialization of spectral properties for optimization in complex high-dimensional landscapes.
- North America > United States (0.14)
- Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)
- Europe > Russia (0.04)
- (4 more...)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.97)
Complex Dynamics in Simple Neural Networks: Understanding Gradient Flow in Phase Retrieval
Mannelli, Stefano Sarao, Biroli, Giulio, Cammarota, Chiara, Krzakala, Florent, Urbani, Pierfrancesco, Zdeborová, Lenka
In many machine learning applications one optimizes a non-convex loss function; this is often achieved using simple descending algorithms such as gradient descent or its stochastic variations. The positive results obtained in practice are often hard to justify from the theoretical point of view, and this apparent contradiction between non-convex landscapes and good performance of simple algorithms is a recurrent problem in machine learning. A successful line of research has studied the geometrical properties of the loss landscape, distinguishing between good minima - that lead to good generalization error - and spurious minima - associated with bad generalization error. The results showed that in some regimes, for several problems from matrix completion [1] to wide neural networks [2, 3], spurious minima disappear and consequently under weak assumptions [4] gradient descent will converge to good minima. However, these results do not justify numerous other results showing that good and spurious minima are present, but systematically gradient descent works [5, 6]. In [7] it was theoretically shown that in a toy model - the spiked matrix-tensor model - it is possible to find good minima with high probability in a regime where exponentially many spurious minima are provably present. In [8] it was shown that this is due to the presence of the so-called threshold states in the landscape, that play a key role in the dynamics of the gradient flow [9,10]: at first attracting it, and successively triggering the converge towards lower minima under certain conditions [11, 12]. However, the spiked matrix-tensor model is an unsupervised learning model and it remained open whether the picture put forward in [7, 8] happens also in learning with neural networks.
- Europe > France > Île-de-France > Paris > Paris (0.04)
- Asia > Middle East > Jordan (0.04)
- Africa > Middle East > Tunisia > Ben Arous Governorate > Ben Arous (0.04)